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Asymptotic equivalence theory developed in the literature so far 
are only for bounded loss functions. This limits the potential appli- 
cations of the theory because many commonly used loss functions in 
statistical inference are unbounded. In this paper we develop asymp- 
totic equivalence results for robust nonparametric regression with 
unbounded loss functions. The results imply that all the Gaussian 
nonparametric regression procedures can be robustified in a unified 
way. A key step in our equivalence argument is to bin the data and 
then take the median of each bin. 

The asymptotic equivalence results have significant practical im- 
plications. To illustrate the general principles of the equivalence argu- 
ment we consider two important nonparametric inference problems: 
robust estimation of the regression function and the estimation of a 
quadratic functional. In both cases easily implementable procedures 
are constructed and are shown to enjoy simultaneously a high degree 
of robustness and adaptivity. Other problems such as construction of 
confidence sets and nonparametric hypothesis testing can be handled 
in a similar fashion. 

1. Introduction. The main goal of the asymptotic equivalence theory is 
to approximate general statistical models by simple ones. If a complex model 
is asymptotically equivalent to a simple model, then all asymptotically op- 
timal procedures can be carried over from the simple model to the complex 
one for bounded loss functions and the study of the complex model is then 
essentially simplified. Early work on asymptotic equivalence theory was fo- 
cused on the parametric models and the equivalence is local. See Le Cam 
(1986). 
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There have been important developments in the asymptotic equivalence 
theory for nonparametric models in the last decade or so. In particular, 
global asymptotic equivalence theory has been developed for nonparametric 
regression in Brown and Low (1996b) and Brown et al. (2002), nonpara- 
metric density estimation models in Nussbaum (1996) and Brown et al. 
(2004), generalized linear models in Grama and Nussbaum (1998), nonpara- 
metric autoregression in Milstein and Nussbaum (1989), diffusion models in 
Delattre and Hoffmann (2002) and Genon-Catalot, Laredo and Nussbaum 
(2002), GARCH model in Wang (2002) and Brown, Wang and Zhao (2003), 
and spectral density estimation in Golubev, Nussbaum and Zhou (2009). 

So far all the asymptotic equivalence results developed in the literature are 
only for bounded loss functions. However, for many statistical applications, 
asymptotic equivalence under bounded losses is not sufficient because many 
commonly used loss functions in statistical inference such as squared error 
loss are unbounded. As commented by Johnstone (2002) on the asymptotic 
equivalence results: "Some cautions are in order when interpreting these 

results Meaningful error measures. . . may not translate into, say, squared 

error loss in the Gaussian sequence model." 

In this paper we develop asymptotic equivalence results for robust non- 
parametric regression with an unknown symmetric error distribution for 
unbounded loss functions which include, for example, the commonly used 
squared error and integrated squared error losses. Consider the nonpara- 
metric regression model 



where the errors £j are independent and identically distributed with some 
density h. The error density h is assumed to be symmetric with median 0, 
but otherwise unknown. Note that for some heavy-tailed distributions such 
as Cauchy distribution the mean does not even exist. We thus do not assume 
the existence of the mean here. One is often interested in robustly estimat- 
ing the regression function / or some functionals of /. These problems have 
been well studied in the case of Gaussian errors. In the present paper we 
introduce a unified approach to turn the general nonparametric regression 
model (1) into a standard Gaussian regression model and then in princi- 
ple any procedure for Gaussian nonparametric regression can be applied. 
More specifically, with properly chosen T and m, we propose to divide the 
observations into T bins of size m and then take the median Xj of the 
observations in the jth bin for j = 1, . . . ,T. The asymptotic equivalence re- 
sults developed in Section 2 show that under mild regularity conditions, for 
a wide collection of error distributions the experiment of observing the me- 
dians {Xj :j = 1, . . . ,T} is in fact asymptotically equivalent to the standard 





i = 1, . . . ,n 
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Gaussian nonparametric regression model 

for a large class of unbounded losses. Detailed arguments are given in Section 
2. 

We develop the asymptotic equivalence results for the general regression 
model (1) by first extending the classical formulation of asymptotic equiva- 
lence in Le Cam (1964) to accommodate unbounded losses. The asymptotic 
equivalence result has significant practical implications. It implies that all 
statistical procedures for any asymptotic decision problem in the setting of 
the Gaussian nonparametric regression can be carried over to solve problems 
in the general nonparametric regression model (1) for a class of unbounded 
loss functions. In other words, all the Gaussian nonparametric regression 
procedures can be robustified in a unified way. We illustrate the applica- 
tions of the general principles in two important nonparametric inference 
problems under the model (1): robust estimation of the regression function 
/ under integrated squared error loss and the estimation of the quadratic 
functional Q(f) = J f 2 under squared error. 

As we demonstrate in Sections 3 and 4 the key step in the asymptotic 
equivalence theory, binning and taking the medians, can be used to construct 
simple and easily implementable procedures for estimating the regression 
function / and the quadratic functional / f 2 . After obtaining the medians 
of the binned data, the general model (1) with an unknown symmetric error 
distribution is turned into a familiar Gaussian regression model, and then a 
Gaussian nonparametric regression procedure can be applied. In Section 3 
we choose to employ a blockwise James-Stein wavelet estimator, BlockJS, 
for the Gaussian regression problem because of its desirable theoretical and 
numerical properties. See Cai (1999). The robust wavelet regression proce- 
dure has two main steps: 

1. Binning and taking median of the bins. 

2. Applying the BlockJS procedure to the medians. 

The procedure is shown to achieve four objectives simultaneously: robust- 
ness, global adaptivity, spatial adaptivity, and computational efficiency. The- 
oretical results in Section 3.2 show that the estimator achieves optimal global 
adaptation for a wide range of Besov balls as well as a large collection of 
error distributions. In addition, it attains the local adaptive minimax rate 
for estimating functions at a point. Figure 1 compares a direct wavelet es- 
timate with our robust estimate in the case of Cauchy noise. The example 
illustrates the fact that direct application of a wavelet regression procedure 
designed for Gaussian noise may not work at all when the noise is in fact 
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Spikes with Cauchy Noise Direct Wavelet Estimate Robust Estimate 
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Fig. 1. Left panel: spikes signal with Cauchy noise. Middle panel: an estimate obtained 
by applying directly a wavelet procedure to the original noisy signal. Right panel: a robust 
estimate by apply a wavelet block thresholding procedure to the medians of the binned data. 
Sample size is 4096 and bin size is 8. 

heavy-tailed. On the other hand, our robust procedure performs well even 
in Cauchy noise. 

In Section 4 we construct a robust procedure for estimating the quadratic 
functional Q(f) = J f 2 following the same general principles. Other problems 
such as construction of confidence sets and nonparametric hypothesis testing 
can be handled in a similar fashion. 

Key technical tools used in our development are an improved moder- 
ate deviation result for the median statistic and a better quantile coupling 
inequality. Median coupling has been considered in Brown, Cai and Zhou 
(2008). For the asymptotic equivalence results given in Section 2 and the 
proofs of the theoretical results in Section 3 we need a more refined moder- 
ate deviation result for the median and an improved coupling inequality than 
those given in Brown, Cai and Zhou (2008). These improvements play a cru- 
cial role in this paper for establishing the asymptotic equivalence as well as 
robust and adaptive estimation results. The results may be of independent 
interest for other statistical applications. 

The paper is organized as follows. Section 2 develops an asymptotic equiv- 
alence theory for unbounded loss functions. To illustrate the general princi- 
ples of the asymptotic equivalence theory, we then consider robust estima- 
tion of the regression function / under integrated squared error in Section 
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3 and estimation of the quadratic functional / f 2 under squared error in 
Section 4. The two estimators are easily implementable and are shown to 
enjoy desirable robustness and adaptivity properties. In Section 5 we de- 
rive a moderate deviation result for the medians and a quantile coupling 
inequality. The proofs are contained in Section 6. 

2. Asymptotic equivalence. This section develops an asymptotic equiv- 
alence theory for unbounded loss functions. The results reduce the gen- 
eral nonparametric regression model (1) to a standard Gaussian regression 
model. 

The Gaussian nonparametric regression has been well studied and it of- 
ten serves as a prototypical model for more general nonparametric function 
estimation settings. A large body of literature has been developed for min- 
imax and adaptive estimation in the Gaussian case. These results include 
optimal convergence rates and optimal constants. See, for example, Pinsker 
(1980), Korostelev (1993), Donoho et al. (1995), Johnstone (2002), Tsybakov 
(2004), Cai and Low (2005, 2006b) and references therein for various esti- 
mation problems under various loss functions. The asymptotic equivalence 
results established in this section can be used to robustify these procedures 
in a unified way to treat the general nonparametric regression model (1). 

We begin with a brief review of the classical formulation of asymptotic 
equivalence and then generalize it to accommodate unbounded losses. 

2.1. Classical asymptotic equivalence theory. Le Cam (1986) developed a 
general theory for asymptotic decision problems. At the core of this theory 
is the concept of a distance between statistical models (or experiments), 
called Le Cam's deficiency distance. The goal is to approximate general 
statistical models by simple ones. If a complex model is close to a simple 
model in Le Cam's distance, then there is a mapping of solutions to decision 
theoretic problems from one model to the other for all bounded loss functions. 
Therefore the study of the complex model can be reduced to the one for the 
simple model. 

A family of probability measures E = {Pg : 9 G 0} defined on the same 
cr-field of a sample space Q is called a statistical model (or experiment). 
Le Cam (1964) defined a distance A(E,F) between E and another model 
F = {Qg:6 G 0} with the same parameter set by the means of "random- 
izations." Suppose one would like to approximate E by a simpler model 
F. An observation x in E can be mapped into the sample space of F by 
generating an "observation" y according to a Markov kernel K x , which is a 
probability measure on the sample space of F. Suppose x is sampled from 
Pg. Write KPg for the distribution of y with KPg(A) = J K x {A)dPg for a 
measurable set A. The deficiency 5 of E with respect to F is defined as the 
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smallest possible value of the total variation distance between KPq and Qq 
among all possible choices of K, that is, 

S(E, F) = inf sup \KPq — Q$\tv. 

K tfee 

See Le Cam (1986, page 3) for further details. The deficiency 5 of E with 
respect to F can be explained in terms of risk comparison. If 5(E, F) < e 
for some e > 0, it is easy to see that for every procedure r in F there exists 
a procedure £ in E such that R(9;£) < R(9;r) + 2e for every 9 £ and 
any loss function with values in the unit interval. The converse is also true. 
Symmetrically one may consider the deficiency of F with respect to E as 
well. The Le Cam 's deficiency distance between the models E and F is then 
defined as 

(3) A(E,F) =max(5(E,F),6(F,E)). 

For bounded loss functions, if A(E,F) is small, then to every statistical 
procedure for E there is a corresponding procedure for F with almost the 
same risk function and vice versa. Two sequences of experiments E n and F n 
are called asymptotically equivalent, if A(E n ,F n ) — ► as n — > oo. The signifi- 
cance of asymptotic equivalence is that all asymptotically optimal statistical 
procedures can be carried over from one experiment to the other for bounded 
loss functions. 

2.2. Extension of the classical asymptotic equivalence formulation. For 
many statistical applications, asymptotic equivalence under bounded losses 
is not sufficient because many commonly used loss functions are unbounded. 
Let E n = {Pg^ n : £ 0} and F n = {Qe iTl : 9 G 0} be two asymptotically equiv- 
alent models in Le Cam's sense. Suppose that the model F n is simpler and 
well studied and a sequence of estimators 9 n satisfy 

^Qe, n nr d(9 n ,9) ^ c as7wcx), 

where d is a distance between 9 and 9, and r, c > are constants. This 
implies that 9 can be estimated by 9 n under the distance d with a rate 
n~ r . Examples include W,Q gn n(9 — 9) 2 — ► c in many parametric estimation 
problems, and E,Q fn n r J( f — f) 2 dfi — ► c, where / is an unknown function 
and < r < 1, in many nonpar ametric estimation problems. The asymptotic 
equivalence between E n and F n in the classical sense does not imply that 
there is an estimator 9* in E n such that 

E Pen n r d(9*,9)^c. 

In this setting the loss function is actually L($,9) =n r d(i),9) which grows 
as n increases, and is usually unbounded. 
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In this section we introduce a new asymptotic equivalence formulation to 
handle unbounded losses. Let A F and be a set of procedures for E and 
F, respectively. Let T be a set of loss functions. We define the deficiency 
distance A(E, F; T, A F , A f ) as follows. 

Definition 1. Define 5(E, F;T, A E , A F ) = inf{e > 0: for every proce- 
dure r € Ap there exists a procedure £ £ A F such that R(9; £) < r) + 2e 
for every £ for any loss function L £ T}. Then the deficiency distance be- 
tween models E and F for the loss class T and procedure classes A^; and A^ 
is defined as A(E, F; T, A E ,A F ) = max{5(E, F; T, A E , A F ), 5(F, E; T,A F ,A E )} 

In other words, if the deficiency A(E, F; T, A E ,A F ) is small, then to every 
statistical procedure for one experiment, there is a corresponding procedure 
for another experiment with almost the same risk function for losses L £ T 
and procedures in A. 

Definition 2. Two sequences of experiments E n and F n are called 
asymptotically equivalent with respect to the sets of procedures A Fn and 
A Fn and set of loss functions T n if A(E n , F n ;T n , A Fn , A Fn ) — > as n^oo. 

If E n and F n are asymptotically equivalent, then all asymptotically op- 
timal statistical procedures in A^ n can be carried over to E n for loss func- 
tions L £ T n with essentially the same risk. The definitions here generalize 
the classical asymptotic equivalence formulation, which corresponds to the 
special case with T being the set of loss functions with values in the unit 
interval. 

For most statistical applications the loss function is bounded by a certain 
power of n. We now give a sufficient condition for the asymptotic equivalence 
under such losses. Suppose that we estimate / or a functional of / under 
a loss L. Let p/,n 

and qf )Tl be the density functions, respectively, for E n 
and F n . Note that in the classical formulation of asymptotic equivalence for 
bounded losses, the deficiency of E n with respect to F n goes to zero if there 
is a Markov kernel K such that 

(4) SUp \KPf, n -Qf,n\TV^ 0. 

/ 

For unbounded losses the condition (4) is no longer sufficient to guarantee 
that the deficiency goes to zero. Let p*j n and q^ n be the density functions 
of KPf n and Q/, n) respectively. Let <p(f) be an estimand, which can be / 

or a functional of /. Suppose that in F n there is an estimator tp(f) q of </?(/) 
such that 
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We woukMike to derive sufficient conditions under which there is an esti- 
mator (p(f) p in E n such that 

J L{^f) p Mf))Pf,n<c{l + o{l)). 

Note that if f(f) p is constructed by mapping over (f(f) q via a Markov kernel 
K, then 

EL&(f) P Mf)) = J L&(f) q Mf))P*f, n 

= J L(rtT) q Mf))Qf,n + J LW) q Mf)Wf, n -<lf,n). 
Let A n = {\p*j n /qf : n — 1| < £n} for some e n — ► 0, and write 

jLMfj q Mf))(pf, n -qf,n) 

= J L{^) q Mf))qfAplJv,n-m{An) + i{A c n )\ 

< J L(^^(/))g / , n (/ /i Jg / , n -l){p} in /g / , n >l}[/(A n ) + /(A^)] 

<e n f L{^iT) q Mf))lf,nI{A n ) + J LW) q MfWf,nI(A c n ). 

If KPf tTl (An) decays exponentially fast uniformly over T and L is bounded 
by a polynomial of n, this formula implies that 

J LW) q Mf)){<lf,n-pln)=o(l). 

Assumption (AO). For each estimand (f(f), each estimator ip(f) € A n 
and each L £ T n , there is a constant M > 0, independent of the loss function 
and the procedure, such that L(tp(f),tp(f)) < Mn M . 

The following result summarizes the above discussion and gives a sufficient 
condition for the asymptotic equivalence for the set of procedures A n and 
set of loss functions F n . 

Proposition 1. Let E n = {Pe, n : # £ 0} and F n = {Qe tTl : 9 G ©} be two 
models. Suppose there is a Markov kernel K such that KPg n and Qe, n are 
defined on the same a -field of a sample space. Let p*j n and q^ n be the density 
functions of KPf, n an d Qf, n w.r.t. a dominating measure such that for a 
sequence e n — > 

S VLj>KP ftn (\pl n /q ftn - 1| > e n ) < C D n~ D 
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for all D > 0, then 5(E n ,F n \T n ,AE n ,A.F n ) — > as n — ► oo under Assump- 
tion (AO). 

Examples of loss functions include 

L{f n J) = n^^ J(fn-f? and L(f n J) = n 2a /^J(^ n -^f) 2 

for estimating / and L(f n , f) = n 2a ^ 2a+1 \f n (t ) - f(t )) 2 for estimating / 
at a fixed point io where a is the smoothness of /, as long as we require 
f n to be bounded by a power of n. If the maximum of f n or f n (to) grows 
faster than a polynomial of n, we commonly obtain a better estimate by 
truncation, for example, defining a new estimate min(/ n ,n 2 ). 

The above discussions suggest that we may study a broad range of loss 
functions under a mild restriction on procedures. In comparison to the clas- 
sic framework of asymptotic equivalence, here the collection of loss functions 
is much broadened to include unbounded losses while the collection of pro- 
cedures is slightly more restrictive to only include those with losses bounded 
by a polynomial power of n. Virtually all practical procedures satisfy this 
condition. Of course in our formulation if the T n is set to be the collection of 
bounded loss functions, then the procedure can be any measurable function. 

2.3. Asymptotic equivalence for robust estimation under unbounded losses. 
We now return to the nonparametric regression model (1) and denote the 
model by E n , 

E n :Yi = f(i/ri) + £i, i = l,...,n. 

An asymptotic equivalence theory for nonparametric regression with a known 
error distribution has been developed in Grama and Nussbaum (2002), but 
the Markov kernel (randomization) there was not given explicitly, and so 
it is not implementable. In this section we propose an explicit and easily 
implementable procedure to reduce the nonparametric regression with an 
unknown error distribution to a Gaussian regression. We begin by dividing 
the interval [0, 1] into T equal- length subintervals. Without loss of general- 
ity, we shall assume that n is divisible by T, and let m = n/T, the number 
of observations in each bin. We then take the median Xj of the observations 
in each bin, that is, 

Xj = median{Yj, (j — l)m + 1 < i < jm}, 

and make statistical inferences based on the median statistics {^}- Let 
F n be the experiment of observing {Xj, 1 < j < T}. In this section we shall 
show that F n is in fact asymptotically equivalent to the following Gaussian 
experiment: 

Gn.Xf = f(j/T) + ^r7=Zj, Zj ^ N(0, 1), 1 < j < T, 
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under mild regularity conditions. The asymptotic equivalence is established 
in two steps. 

Suppose the function / is smooth. Then / is locally approximately con- 
stant. We define a new experiment to approximate E n as follows: 

E* n :Y* = f*(i/n)+^ 1 < * < n, 

where f*(i/n) = /( ^ %T ^ ). For each of the T subintervals, there are m ob- 
servations centered around the same mean. 

For the experiment we bin the observations Y* and then take the 
medians in exactly the same way and let Xj be the median of the Y*'s in 
the jth subinterval. If E 1 * approximates E n well, the statistical properties 
Xj are then similar to Xj. Let r/j be the median of corresponding errors £j 
in the jth. bin. Note that the median of Xj has a very simple form: 

F*:X*=f{j/T)+r, 3 , l<j<T. 

Theorem 6 in Section 5 shows that j]j can be well approximated by a normal 
variable with mean and variance 4m ^ nyj , which suggests that F* is close 
to the experiment G n . 

We formalize the above heuristics in the following theorems. We first 
introduce some conditions. We shall choose T = n 2//3 /logn and assume that 
/ is in a Holder ball, 

(5) / 6 T = {/ : \f(y) - f(x)\ < M\x - y\ d }, d > 3/4. 



Assumption (Al). Let ^ be a random variable with density function 
h. Define r a (^) = log an< i / i ( a ) = Assume that 

(6) /x(a) < Ca 2 , 

(7) Eexp[t(r a (0 - fi(a))] < exp(Ct 2 a 2 ), 

for < |a| < e and < \ta\ < e for some e > 0. Equation (7) is roughly equiva- 
lent to Var(r a (£)) < Ca 2 . Assumption (Al) is satisfied by many distributions 
including Cauchy and Gaussian. 

The following asymptotic equivalence result implies that any procedure 
based on Xj has exactly the same asymptotic risk as a similar procedure by 
just replacing Xj by X* . That is, the experiments F n and F* are asymp- 
totically equivalent. 

Theorem 1. Under Assumptions (AO) and (Al) and the Holder condi- 
tion (5), the two experiments E n and are asymptotically equivalent with 
respect to the set of procedures A n and set of loss functions T n . 
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The following asymptotic equivalence result implies that asymptotically 
there is no need to distinguish Aj"s from the Gaussian random variables 
Xj*'s. We need the following assumptions on the density function h{x) of 

£• 

Assumption (A2). h(x) = ±, h(0) > 0, and \h(x) - h(0)\ < Cx 2 in 

an open neighborhood of 0. 

The last condition \h(x) — h(0)\ < Cx 2 is basically equivalent to h'(0) = 0. 
The Assumption (A2) is satisfied when h is symmetric and h" exists in a 
neighborhood of 0. 

Theorem 2. Under Assumptions (AO) and (A2), the two experiments 
F* and G n are asymptotically equivalent with respect to the set of procedures 
A n and set of loss functions T n . 

These theorems together imply that, under assumptions (Al) and (A2) 
and the Holder condition (5), the experiment F n is asymptotically equivalent 
to G n with respect to the set of procedures A n and set of loss functions r n . 
So any statistical procedure 5 in G n can be carried over to the E n (by 
treating Xj as if it were Xj*) in the sense that the new procedure has the 
same asymptotic risk as 8 for all loss functions bounded by a certain power 
of n. 

2.4. Discussion. The asymptotic equivalence theory provides deep in- 
sight and useful guidance for the construction of practical procedures in 
a broad range of statistical inference problems under the nonparametric 
regression model (1) with an unknown symmetric error distribution. Inter- 
esting problems include robust and adaptive estimation of the regression 
function, estimation of linear or quadratic functionals, construction of con- 
fidence sets, nonparametric hypothesis testing, etc. There is a large body of 
literature on these nonparametric problems in the case of Gaussian errors. 
With the asymptotic equivalence theory developed in this section, many 
of these procedures and results can be extended and robustified to deal 
with the case of an unknown symmetric error distribution. For example, 
the SureShrink procedure of Donoho and Johnstone (1995), the empirical 
Bayes procedures of Johnstone and Silverman (2005) and Zhang (2005), and 
SureBlock in Cai and Zhou (2009) can be carried over from the Gaussian 
regression to the general nonparametric regression. Theoretical properties 
such as rates of convergence remain the same under the regression model 
(1) with suitable regularity conditions. 

To illustrate the general ideas, we consider in the next two sections two 
important nonparametric problems under the model (1): adaptive estimation 
of the regression function / and robust estimation of the quadratic functional 
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Qif) = If 2 - These examples show that for a given statistical problem it is 
easy to turn the case of nonparametric regression with general symmetric 
errors into the one with Gaussian noise and construct highly robust and 
adaptive procedures. Other robust inference problems can be handled in a 
similar fashion. 

3. Robust wavelet regression. We consider in this section robust and 
adaptive estimation of the regression function / under the model (1). Many 
estimation procedures have been developed in the literature for case where 
the errors £j are assumed to be i.i.d. Gaussian. However, these procedures 
are not readily applicable when the noise distribution is unknown. In fact 
direct application of the procedures designed for the Gaussian case can fail 
badly if the noise is in fact heavy-tailed. See, for example, Figure 1 in the 
Introduction. 

In this section we construct a robust procedure by following the general 
principles of the asymptotic equivalence theory developed in Section 2. The 
estimator is robust, adaptive, and easily implementable. In particular, its 
performance is not sensitive to the error distribution. 

3.1. Wavelet procedure for robust nonparametric regression. We begin 
with basic notation and definitions and then give a detailed description of 
our robust wavelet regression procedure. 

Let {4>,ip} be a pair of father and mother wavelets. The functions (j) and 
ip are assumed to be compactly supported and f <fi = l. Dilation and trans- 
lation of (p and ijj generate an orthonormal wavelet basis. For simplicity in 
exposition, we work with periodized wavelet bases on [0, 1]. Let 



^,*(*) = E <M* " 0. V hk {t) = £ ^,k{t - l) for t G [0, 1], 



where </>j, k (t) = 2^ 2 (f)(2H - k) and Vj,fc(*) = 2 j / 2 if)(2H - k). The collection 
{<j} P j k ,k = 1, . . . , 2- M ; V'j fc> 3 — Jo > 0j k = 1, . . . , 2 J } is then an orthonormal 
basis of L 2 [0,1], provided the primary resolution level jo is large enough 
to ensure that the support of the scaling functions and wavelets at level jo 
is not the whole of [0, 1] . The superscript "p" will be suppressed from the 
notation for convenience. An orthonormal wavelet basis has an associated 
orthogonal Discrete Wavelet Transform (DWT) which transforms sampled 
data into the wavelet coefficients. See Daubechies (1992) and Strang (1992) 
for further details on wavelets and discrete wavelet transform. A squarc- 
integrable function / on [0, 1] can be expanded into a wavelet series, 



oo 



oo 





k=l 



3=30 k=l 
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where 9j )k = (f,(f>j )k ),9j tk = are the wavelet coefficients of /. 

We now describe the robust regression procedure in detail. Let the sample 
{Yi, i = 1, . . . , n} be given as in (1). Set J = [log 2 i og i+fc n J f° r some ^ > and 

let T = 2 J . We first group the observations Yj consecutively into T equi- 
length bins and then take the median of each bin. Denote the medians 
by X = (Xi, . . . ,Xt). Apply the discrete wavelet transform to the binned 
medians X and let U = T~ l l 2 W X be the empirical wavelet coefficients, 
where W is the discrete wavelet transformation matrix. Write 

(9) U = (y j0jl , . . . , y jQt2 j a , y j0y i, y jQt230 , . . . , . . . , y,/_i,2.>-i)'- 

Here y~j 0: k are the gross structure terms at the lowest resolution level, and 
Uj,k (j ' = JO) • • • > J — 1> k = 1, . . . , 2 J ) are empirical wavelet coefficients at level 
j which represent fine structure at scale 2 3 . Set 

(10) an = 2hk^- 

Then the empirical wavelet coefficients can be written as 

(11) Uj,k = dj,k + e j,k + PnZj,k + £j,k, 

where 6j t k are the true wavelet coefficients of /, ejk are "small" determin- 
istic approximation errors, Zjk are i-i-d. N(Q,1), and are some "small" 
stochastic errors. The asymptotic equivalence theory given in Section 2 indi- 
cates that both €j k and £j k are "negligible" and the calculations in Section 
6 will show this is indeed the case. If these negligible errors are ignored then 
we have 

(12) y j>k w 9 j>k + a n z j>k with z j>k ' N(0, 1), 

which is the idealized Gaussian sequence model. 

The BlockJS procedure introduced in Cai (1999) for Gaussian nonpara- 
metric regression is then applied to j/,- k as if they are exactly distributed as 
in (12). More specifically, at each resolution level j, the empirical wavelet 
coefficients y,- k are grouped into nonoverlapping blocks of length L. Let 
B) = {(j, k):(i-l)L + l<k< iL} and let . = Z(j,k)eB> Vj,k- Let °l be 
an estimator of o\ [see (16)] for an estimator). Set J* = |_log 2 i og T+b n \ ■ A 

modified James-Stein shrinkage rule is then applied to each block B % - with 
j < J*, that is, 

(13) jtk =(l- VoM for (j, k)eB), 

where A* = 4.50524 is a constant satisfying A* — log A* = 3. For the gross 

structure terms at the lowest resolution level j'o, we set 9j 0lk = Vj ,k- The es- 
timate of / at the sample points : i = 1, . . . ,T} is obtained by applying the 
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inverse discrete wavelet transform (IDWT) to the denoised wavelet coeffi- 
cients. That is, {/(y) : i = 1, . . . , T} is estimated by / = {/(y) : i = 1, . . . , T} 
with / = T 1 / 2 !^ -1 • 6. The whole function / is estimated by 

2K A J*-l 2J 

(i4) ; n (t) = y, (*)+ E '•>(')• 

fe=i ^=io fc=i 



Remark 1. An estimator of ft, 2 (0) can be given by 

(15) ^ 2 (0) = ^E(^i-^ 2fc ) 2 
and the variance a 2 is then estimated by 

(16) a « = WW„ = ^ x ^- x * )2 - 



It is shown in Section 6 that the estimator <r 2 is an accurate estimate of a 2 . 



The robust estimator f n constructed above is easy to implement. Figure 
2 below illustrate the main steps of the procedure. As a comparison, we 
also plotted the estimate obtained by applying directly the BlockJS proce- 
dure to the original noisy signal. It can be seen clearly that this wavelet 
procedure does not perform well in the case of heavy-tailed noise. Other 
standard wavelet procedures have similar performance qualitatively. On the 
other hand, the BlockJS procedure performs very well on the medians of the 
binned data. 

3.2. Adaptivity and robustness of the procedure. The robust regression 
procedure presented in Section 3.1 enjoys a high degree of adaptivity and 
robustness. We consider the theoretical properties of the procedure over the 
Besov spaces. For a given r-regular mother wavelet tp with r > a and a fixed 
primary resolution level jo, the Besov sequence norm || • \\b°[ of the wavelet 
coefficients of a function / is defined by 



(17) 



p.q 




where £ . q is the vector of the father wavelet coefficients at the primary 
resolution level jo, Qj is the vector of the wavelet coefficients at level j, and 
s = a + 7} — ^>0. Note that the Besov function norm of index (a,p, q) of a 
function / is equivalent to the sequence norm (17) of the wavelet coefficients 
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Fig. 2. Top left panel: noisy Spikes signal with sample size n = 4096 where the noise 
has t2 distribution. Top right panel: the medians of the binned data with the bin size 
m = 8. Middle left panel: the discrete wavelet coefficients of the medians. Middle right 
panel: blockwise thresholded wavelet coefficients of the medians. Bottom left panel: the 
robust estimate of the Spikes signal (dotted line is the true signal). Bottom right panel: the 
estimate obtained by applying directly the BlockJS procedure to the original noisy signal. 
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of the function. See Meyer (1992), Triebel (1992) and DeVore and Popov 
(1988) for further details on Besov spaces. We define 

(18) B« q (M) = {f;\\f\\ b?q <M}. 

In the case of Gaussian noise the minimax rate of convergence for estimating 
/ over the Besov body 5" „{M) is n - 2a /( 1 + 2a ). See Donoho and Johnstone 
(1998). 

We shall consider the following collection of error distributions. For < 
ei < 1, ei > 0, i = 2,3,4, let 

H ei ,e 2 = {h:f h(x) = l,e l <h(0)<-, 



(19) 



x 2 



\h{x) - h(0)\ < — for all \x\ < e 2 



(20) 



and define TL = H(ei, £2, £3, €4) by 

H = j/i G W eii e 2 : J \x\ e ' A h(x) dx < €4, 

h(x) = h(-x), |/i (3) (x)| < e 4 for |x| < e 3 |. 

The assumption / \x\ e3 h(x) dx < €4 guarantees that the moments of the me- 
dian of the binned data are well approximated by those of the normal ran- 
dom variable. Note that this assumption is satisfied by a large collection of 
distributions including Cauchy distribution. 

The following theorem shows that our estimator achieves optimal global 
adaptation for a wide range of Besov balls Bp q (M) defined in (18) and 
uniformly over the family of error distributions given in (20). 



Theorem 3. Suppose the wavelet ip is r-regular. Then the estimator f n 

2a ^ p> 



defined in (14) satisfies, for p>2, a <r and 1 2 ° 2a > ^ 



sup sup E\\f n -f\\l<Cn- 2a ^ 1+2 ^ 
h£HfeB« q (M) 

and for 1 <p < 2, a<r and 2 " 2a > | , 

sup sup E\\f n - f\\ 2 <Cn- 2a ^ 1+2a \logn)^ 2 - p) /^ 1+2a ^. 
heHfeB« q (M) 

In addition to global adaptivity, the estimator also enjoys a high degree 
of local spatial adaptivity. For a fixed point to £ (0, 1) and < a < 1, define 
the local Holder class A a (M,t ,5) by 

A Q (M,t , 5) = {/ : \f(t) - f(t )\ < M\t - t \ a for t G (t - S,t + 5)}. 
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If a > 1, then 

A a (M,t ,5) = {f:\f^\t)-f^\t )\<M\t-t \ a ' for te (t -5, t + 6)}, 

where [a\ is the largest integer less than a and a' = a — [a\. 

In Gaussian nonparametric regression setting, it is well known that the 
optimal rate of convergence for estimating f(to) over A a (M, to, 5) with a 
completely known is n _2Q /( 1 + 2Q ). On the other hand, when a is unknown, 
Lepski (1990) and Brown and Low (1996a) showed that the local adaptive 
minimax rate over the Holder class A a (M,t ,S) is (logn/n) 2 °/( 1+2Q ) . bo one 
has to pay at least a logarithmic factor for adaptation. 

Theorem 4 below shows that our estimator achieves optimal local adap- 
tation with the minimal cost uniformly over the family of noise distributions 
defined in (20). 

Theorem 4. Suppose the wavelet ip is r-regular with r > a > 0. Let 
to G (0, 1) be fixed. Then the estimator f n defined in (14) satisfies 

(21) sup sup E(f n (t )-f(t )) 2 <C- 

heHfeA a (M,t ,S) 

Remark 2. Note that in the general asymptotic equivalence theory 
given in Section 2 the bin size was chosen to be ra 1//3 log?7,. However, for 
specific estimation problems such as robust estimation of / discussed in this 
section, the bin size can be chosen differently. Here we choose a small bin size 
log 1+6 n. There is a significant advantage in choosing such a small bin size 
in this problem. Note that the smoothness assumptions for a in Theorems 
3 and 4 are different from those in Theorems 3 and 4 in Brown, Cai and 
Zhou (2008). For example, in Theorem 4 of Brown, Cai and Zhou (2008) it 
was assumed a > 1/6, but now we need only a > due to the choice of the 
small bin size. 

4. Robust estimation of the quadratic functional / f 2 . An important 
nonparametric estimation problem is that of estimating the quadratic func- 
tional Q(f) = J f 2 . This problem is interesting in its own right and closely 
related to the construction of confidence balls and nonparametric hypothesis 
testing in nonparametric function estimation. See, for example, Li (1989), 
Diimbgen (1998), Spokoiny (1998), Genovese and Wasserman (2005) and 
Cai and Low (2006a). In addition, as shown in Bickel and Ritov (1988), 
Donoho and Nussbaum (1990) and Fan (1991), this problem connects the 
nonparametric and semiparametric literatures. 

Estimating the quadratic functional Q(f) has been well studied in the 
Gaussian noise setting. See, for example, Donoho and Nussbaum (1990), Fan 
(1991). Efromovich and Low (1996), Laurent and Massart (2000), Klemela 
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(2006) and Cai and Low (2005, 2006b). In this section, we shall consider 
robust estimation of the quadratic functional Q(f) under the regression 
model (1) with an unknown symmetric error distribution. We shall follow 
the same notation used in Section 3. Note that the orthonormality of the 
wavelet basis implies the isometry between the L2 function norm and the £2 
wavelet sequence norm which yields 

2*> 00 TP 



k=l j=jo k=l 

The problem of estimating Q(f) is then translated into estimating the 
squared coefficients. 

We consider adaptively estimating Q(f) over Besov balls Bp q (M) with 
a > \ + We shall show that it is in fact possible to find a simple procedure 
which is asymptotically rate optimal simultaneously over a large collection 
of unknown symmetric error distributions. In this sense, the procedure is 
robust. 

As in Section 3, we group the observations Y{ into T bins of size log 1+fe (ra) 
for some b > and then take the median of each bin. Let X = (X\, . . . , Xt) 
denote the binned medians and let U = T~ l l 2 W X be the empirical wavelet 
coefficients, where W is the discrete wavelet transformation matrix. Write 
U as in (9). Then the empirical wavelet coefficients can be approximately 
decomposed as in (12): 

(22) Vjo,k~Gj ,k + crnZjo,k and y j)k « 9j >k + a n Zj,k, 

where o~ n = l/(2/i(0)y / n) and Zj k and Zj± are i.i.d. standard normal vari- 
ables. 

The quadratic functional Q{f) can then be estimated as if we have exactly 
the idealized sequence model (22). More specifically, let J q = [log 2 y/n\ and 
set 

2 J Jq 2 j 

k=l j=jo k=l 

The following theorem shows that this estimator is robust and rate- 
optimal for a large collection of symmetric error distributions and a wide 
range of Besov classes simultaneously. 

Theorem 5. For all Besov balls Bp q (M) with a > ~ + \ , the estimator 
Q given in (23) satisfies 

(24) sup Ef {Q-Q(f)f<^-n-\l + o(l)). 

/es« 9 (M) n (0) 
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Remark 3. We should note that there is a slight tradeoff between effi- 
ciency and robustness. When the error distribution is known to be Gaussian, 
it is possible to construct a simple procedure which is efficient, asymptoti- 
cally attaining the exact minimax risk 4M 2 n _1 . See, for example, Cai and 
Low (2005). In the Gaussian case, the upper bound in (24) is 2irM 2 n~ 1 
which is slightly larger than 4M 2 n _1 . On the other hand, our procedure is 
robust over a large collection of unknown symmetric error distributions. 

The examples of adaptive and robust estimation of the regression function 
and the quadratic functional given in the last and this sections illustrate 
the practical use of the general principles in the asymptotic equivalence 
theory given in Section 2. It is easy to see that other nonpar ametric inference 
problems such as the construction of confidence sets and nonparametric 
hypothesis testing under the general nonparametric regression model (1) 
can be handled in a similar way. Hence, our approach can be viewed as a 
general method for robust nonparametric inference. 

5. Technical tools: moderate deviation and quantile coupling for median. 

Quantile coupling is an important technical tool in probability and statistics. 
For example, the celebrated KMT coupling results given in Komlos, Major 
and Tusnady (1975) plays a key role in the Hungarian construction in the 
asymptotic equivalence theory. See, for example, Nussbaum (1996). Stan- 
dard coupling inequalities are mostly focused on the coupling of the mean of 
i.i.d. random variables with a normal variable. Brown, Cai and Zhou (2008) 
studied the coupling of a median statistic with a normal variable. For the 
asymptotic equivalence theory given in Section 2 and the proofs of the theo- 
retical results in Section 3 we need a more refined moderate deviation result 
for the median and an improved coupling inequality than those given in 
Brown, Cai and Zhou (2008). This improvement plays a crucial role in this 
paper. It is the main tool for reducing the problem of robust regression with 
unknown symmetric noise to a well studied and relatively simple problem of 
Gaussian regression. The result here may be of independent interest because 
of the fundamental role played by the median in statistics. 

Let X be a random variable with distribution G, and Y with a continuous 
distribution F. Define 

(25) X = G-\F{Y)), 

where G~ l {x) = ml{u:G{u) > x}, then C(X) = C(X). Note that X and Y 
are now defined on the same probability space. This makes it possible to 
give a pointwise bound between X and Y. For example, one can couple 
Binomial(m, 1/2) and N(m/2,m/ A) distributions. Let X = 2(W—m/2)/y/rn 
with W ~ Binomial (m, 1/2) and Y ~ N(0, 1), and let X(Y) be defined as 
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in (25). Komlos, Major and Tusnady (1975) showed that for some constant 
C > and e > 0, when \X\ < e^/m, 

(26) \X -Y\<-^= + -^=\X\ 2 . 



Let £i, . j£m be i.i.d. random variables with density function h. Denote 
the sample median by £ me d- The classical theory shows that the limiting 
distribution of 2/i(0)y / m£ mc( j is iV(0, 1). We will construct a new random 
variable Cmcd by using quantile coupling in (25) such that £(£ m cd) = £(£med) 
and show that £ me d can be well approximated by a normal random variable 
as in (26). Denote the distribution and density function the sample median 
£mcd by G and g, respectively. We obtain an improved approximation of the 
density g by a normal density which leads to a better moderate deviation 
result for the distribution of sample median and consequently improve the 
classical KMT bound from the rate 1/y/m to 1/m. A general theory for 
improving the classical quantile coupling bound was given in Zhou (2006). 

Theorem 6. Let Z~ iV(0, 1) and let £i,---,£m be i.i.d. with density 
function h, where m = 2k + 1 for some integer k>l. Let Assumption (A 2) 
hold. Then, for \x\ <e, 



(27) g{x) = V 8k fi°) exp(-8M 2 (0)x 2 /2 + 0(kx 4 + k" 1 )) 

V27T 



and for < x < e, 

G(-x) = <S>{-x)exp{0(kx 4 + k~ 1 )) and 

(28) 

G(x) = exp(0(ica; 4 + k' 1 )), 

where G(x) = 1 — G(x), and 3>(x) = 1 — $>(x). Consequently, for every m, 
there is a mapping £, me d{Z) '■ ^ l—> ^ such that £,(f; me d(Z)) = £(£ me d) and 

(29) |2/i(0)V^Lcd -Z\< — + -|2/i(0)v^emed| 3 , when |Lcd| < e 

m m 

and 

— ~ C 

(30) |2/i(0)V^Cmed - Z\ < — (1 + \Z\ 3 ), when\Z\<e^/r7i, 

m 

where C, e > depend on h but not on m. 

Remark 4. In Brown, Cai and Zhou (2008), the density g of the sample 
median was approximated by a normal density as 



g{x) = V8kh(0) exp (_ 8A;/l 2( 0)x 2 /2 + Q( k \ x \3 + | z | + for | x | < £ _ 

V27r 
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Since £ me d = O p (l/ ^/m), the approximation error <3(£;|:z;| 3 + \x\ + is at 
the level of 1/y/m. In comparison, the approximation error 0(kx 4 + k" 1 ) in 
(27) is at the level of 1/m. This improvement is necessary for establishing 
(36) in the proof of Theorem 2, and leads to an improved quantile coupling 
bound (30) over the bound obtained in Brown, Cai and Zhou (2008): 

— ~ C C ~ 

|2/i(0) v / mCmed - Z\< —= + ~^Z 2 , when |£ med | < e. 
Jm Jm 



Since Z is at a constant level, we improve the bound from a classical rate 
1/Jm to 1/m. 

Although the result is only given to m odd, it can be easily extended to 
the even case as discussed in Remark 1 of Brown, Cai and Zhou (2008). The 
coupling result given in Theorem 6 in fact holds uniformly for the whole 
family of h G H eit€2 . 

Theorem 7. Let £i, ■ • ■ ,£m ^ e i-i-d- with density h G W eij£2 in (19). For 
every m = 2k + l with integer k>l, there is a mapping £, m cd(Z) : M i— > K such 
that C(^ me d(Z)) = £(£ me d) and for two constants C £lj£2 , e ei ,e 2 > depending 
only on t\ and €2, 

\2h(0)V^Ued ~Z\< ^ + ^|2^(0)V^Lcd| 3 , 

(31) 

when |£ mcd | < e eue2 

and 

|2/i(0)-v/mfmod ~Z\< + ^i^|Z| 3 , wfcen \Z\ < e £l eaV ^, 

m m 

uniformly over all h £ 7i ei)t2 . 

6. Proofs. We shall prove the main results in the order of Theorems 
6 and 7, Theorems 1 and 2, Theorem 3, and then Theorem 5. Theorems 
6 and 7 provide important technical tools for the proof of the rest of the 
theorems. For reasons of space, we omit the proof of Theorem 4 and some 
of the technical lemmas. See Cai and Zhou (2008) for the complete proofs. 

In this section, C denotes a positive constant not depending on n that 
may vary from place to place and we set d = min(a — -, 1). 



6.1. Proofs of Theorems 6 and 7. We only prove (27) and (28). It fol- 
lows from Zhou (2006) that the moderate deviation bound (28) implies the 
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coupling bounds (29) and (30). Let H(x) be the distribution function of £i. 
The density of the median is 



9{x) 



(2fc + l) ! E 
(kl) 2 



-H k (x)(l-H(x)) k h(x). 



Stirling's formula, j\ = y/2irj : > +1 / 2 exp(— j + e^) with ej = 0(1/ j), gives 
9(?) = {2 ll^ ^H{x){l - H(x))] k h(x) 



4 k (k\) 2 

2^2k + l f2k + l\ 2k+1 
~2k 



eV27r 



[4H(x)(l - H(x))fh(x) exp O 



It is easy to see \\/2k + \j\f2~k — 1| < k 1 , and 



2fc + l\ 2fc + 1 
2fc 



exp -(2k + l)log 1 



1 



2Jfe + l 



exp 1 + O 



Then we have, when < il(x) < 1 

/ §jfe 



(32) 



4F(x)(l - H(x))] K h(x)exp( O 



From the assumption in the theorem, Taylor's expansion gives 
AH(x)(l - H(x)) = 1 - A(H(x) - H(0)) 2 



1-4 



(h(t) - h(0)) dt + h(0)x 



= l-4(/i(0)x + O(|x| 3 )) 2 

for < \x\ < e, that is, log(4ff(x)(l - H(x))) = -4h 2 (0)x 2 + 0(x 4 ) when 
\x\ < 2e for some e > 0. Here e is chosen sufficiently small so that h(x) > 
for \x\ < 2e. Assumption (A2) also implies 

h(x) 



h(0) 

Thus, for Ixl < 2e, 



g(x) 



1 + 0(x 2 ) = exp(0(x 2 )) for \x\ < 2e. 



8kh(0) 



2tt 

8 kh(0) 
2^ 



exp(-8kh 2 (0)x 2 /2 + 0(kx i + x 2 + A;" 1 )) 
exp(-8/fc/i 2 (0)x 2 /2 + 0(kx i + AT 1 )). 



Now we approximate the distribution function of £ mc d by a normal dis- 
tribution. Without loss of generality, we assume h(0) = 1. We write 



9(x) 



exp(-8fcx 2 /2 + 0(kx 4 + AT 1 )) for \x\ < 2e. 
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Now we use this approximation of density functions to give the desired 
approximation of distribution functions. Specifically, we shall show 

(33) G(x)= f X g(t)dt<$(V8kx)exp(C(kx 4 + AT 1 )) 
and 

(34) G{x) > <5>{V8kx)exp(-C(kx 4 + k~ 1 )) 

for all — e < x < and some C > 0. The proof for < x < £ is similar. We 
now prove inequality (33). Note that 

($(V8kx) exp{C(kx 4 + A; -1 )))' 

(35) =V8k~v(V8kx)exp(C(kx 4 + k~ 1 )) 

+ <f>(V8kx)4kCx 3 e-xp{C(kx 4 + AT 1 )). 



From Mill's ratio inequality, we have &(\/8kx)(— v8fa) < ip(y8kx) and 
hence 

<S>(V8k~x)(4Ckx 3 )exp{C(kx 4 + k~ 1 )) 

> V8kip(V8kx) (-^ 2 ) exp(C(kx 4 + AT 1 )). 

This and (35) yield 

($(y/&cx) exp(C(A:x 4 + AT 1 )))' 

> V8k(p(V8kx) (l - ^x 2 ^j exp(C(A;x 4 + A;" 1 )) 

> V8kip(V8kx) exp(-GV) exp(C(A;x 4 + AT 1 )) 

> \/8kip{V8kx)exp{C{kx 4 + k- 1 )/4). 

Here in the second inequality we apply 1 — t/2 > exp(— t) when < t < 1/2. 
Thus we have 

(*(V8fcc) exp(C(A:x 4 + AT 1 )))' > V&k<p(V$kx) exp(C(A;x 4 + AT 1 )) 
for C sufficiently large and for — 2e < x < 0. Then 

g(t)dt< r ($(v / 8A?t)exp(C(A:x 4 + A:~ 1 ))) / 

2e ./-2e 

$(V8fcc) exp(C(A:x 4 + A;" 1 )) 
-$(V§fc • (2e)) exp(C(A;(2e) 4 + A;" 1 )) . 

< ^(vfe) exp(C(fcx 4 + A;" 1 )). 
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In (32) we see 

r- 2e f-2e (Oh _i_ 1 \\ , 

J ^ g(t)dt = J ^ ^±^L H \t){i-H{t)) k h(t)dt 

f H{-2e) (2k + l)\ k k 

Jh{-3e/2) {kiy 



H{-2e) [K\Y 



oik- 1 ) / g(t)dt, 



2c 

where the third equality is a result of the fact that u\(l — u\) k = o{k~ l )u2{l — 
u 2 ) k uniformly for ui £ [0,#(-2e)] and u 2 £ [H(-3e/2), H(-e)]. Thus we 
have 

G(x) < $(V8kx) exp((_7fcr 4 + Ck^ 1 ), 
which is (33). Equation (34) can be established in a similar way. 

Remark. Note that in the proof of Theorem 6 it can be seen easily that 
constants C and e in (29) depends only on the ranges of h(0) and the bound 
of Lipschitz constants of h at a fixed open neighborhood of 0. Theorem 7 
then follows from the proof of Theorem 6 together with this observation. 

6.2. Proofs of Theorems 1 and 2. 

Proof of Theorem 1. Let e n be a sequence approaching to slowly, 
for example, e n = 1/ log log n. Let pj tTl be the joint density of Y^s and p*^ n 
be the joint density of Y*'s. And let Pf tTl be the joint distribution of Yi's 
and Pf*, n be the joint distribution of Y* f s. We want to show that 

max{Pj* jn (|l —pf* jT i/Pf ,n\ — ),Pf,n(\ 1 -Pf,n/Pf*,n\ > ^n)} 

decays exponentially fast uniformly over the function space. 

Note that Pf* jTl (\l -Pf*,n/Pf,n\ > ^n) = J°o,to(| 1 - Po,n/Pf*-f,n\ > It 
suffices to show that Po,n(| ^°z{Pf*-f,n/po,n)\ > £n) decays exponentially fast. 
Write 

lOg(pf*-f, n /po,n) = Z^ lo § 
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with cij = f*(i/n) — f(i/n), where £j has density h(x). Under Assumption 
(Al), we have Er 0i (&) < Caf and Eexp[t(r ai (£i) - fi(oi))] < exp(Ct 2 a 2 ) 
which imply 



P ,n |^exp 
Since 



i=l 



> exp(ie n ) ] < exp I Ct 2 a 2 - ie n 



J2a 2 <C in .( d = Cm 1 - 4 */ 3 log 2d n, 



i=l 



log n 



which goes to zero for d > 3/4, by setting t = n( 4d / 3 ^l 2 the Markov inequal- 
ity implies that Po, n (\ l°g(p/*-/,n/po,n)| > £n) decays exponentially fast. □ 

Proof of Theorem 2. Let gf jn be the joint density of X*'s and qf >n 
be the joint density of X**'s. And let Go, n be the joint distribution of rj/s 
and Qo,n be the joint distribution of Zj's. Theorem 6 yields 

g (x) = V^MO) eX p(-4mh 2 (0)z; 2 /2 + ( m2 ,4 + m -i^ 



for |x| < m 1 ^ 3 . Since Co,-n.(|^?j I > m 1 ^ 3 ) and Qo,n(|Zj| > m 1 ^ 3 ) decay ex- 
ponentially fast, it suffices to study 

i=1 9a m \ A j) 

Let 

K^) = log/^r/(|^|<m-V3) 

with Zj normally distributed with density <f> am (x). It can be easily shown 
that 

El(Zj) < CQoJl - -2f&L) 2 < C im - 2 

V < Pa m \ /j i)J 

and 

Var(/(Zj))<C7m- 2 . 
Since |Z,-| < Cm" 1 / 3 , then |Z(Zj)| < Cm -1 / 3 . Taylor's expansion gives 

Eexp[t(Z(Zj) - EZ(Z,-))] < exp(C7t 2 m- 2 ) 
for i = log 3 / 2 n, then similar to the proof of Theorem 1 we have 
(36) Q f:n (\ log(gf, n /qf, n )\ > e„) < exp(Ct 2 Tm- 2 - te n ). 

Since TmT 2 = l/log 3 n — > 0, it decays faster than any polynomial of n. □ 
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6.3. Proof of Theorems 3 and 4- In the proofs of Theorems 3, 4 and 5, 
we shall replace a 2 by a\. We assume that h(0) is known and equal to 1 
without loss of generality, since it can be shown easily that the estimator 
h(0) given in (15) satisfies 

(37) ^{|^~ 2 (0) - h- 2 {0)\ > n~ 5 } < Cl n- 1 

for some 5 > and all constants I > 1. Note that E^Efe-i — ^2k) 2 = 
jh~ 2 (0) + 0{y/mT~ d ), and it is easy to show 



E 



8m 



Y,(x 2k -i-x 2k ) 2 -h-\o) 



where y/mT~ d = n~ s with 5 > in our assumption. Then (37) holds by 
Chebyshev's inequality. It is very important to see that the asymptotic risk 
properties of our estimators for / in (13) and Q(f) in (23) do not change 
when replacing a 2 by a 2 (l + 0(n~ s )), thus in our analysis we may just 
assume that h(0) is known without loss of generality. 

For simplicity, we shall assume that n is divisible by T in the proof. The 
coupling inequality and the fact that a Besov ball B™ q {M) can be embedded 
into a Holder ball with smoothness d = min(a — -, 1) > [see Meyer (1992)] 
enable us to precisely control of the errors. Proposition 2 gives the bounds 
for both the deterministic and stochastic errors. 



Proposition 2. Let Xj be given as in our procedure and let f £ B™ q (M). 
Then Xj can be written as 

j0 + 2" 



(38) V^X j = y/mf[^)+-Z j + € j + Ci, 



where: 

(i) z/^ d 'iV(0,^); 

(ii) ej are constants satisfying \ej\ < C\fmT~ d and so ^ Y^J=i e j — CT~ 2d ; 
(hi) ("j are independent and "stochastically small'' random variables sat- 
isfying with = 0, and can be written as 



with 



Cj - Cji + Cj2 + Cj3 

c 

E( j2 = and \( 2 j\ <— (l + \Zj\ 3 ), 

P(Cj3 = 0) > 1 — Cexp(— em) and E\(j^\ D exists 
for some e > and C > 0, and all D > 0. 
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Remark 5. Equation (38) is different than Proposition 1 in Brown, Cai 
and Zhou (2008), where there is an additional bias term \J~rrih rn . Lemma 5 
in Brown, Cai and Zhou (2008) showed that the bias & m can be estimated 
with a rate max{T _2d , m -4 }. Therefore in that paper we need to choose the 
bin size m = n 1 / 4 such that m -4 = o(n~ 2a ^ 2a+1 ^) is negligible relative to 
the minimax risk. In the present paper we can choose m = log 1+b n because 
there is no bias term and as a result the condition on the smoothness is 
relaxed. 



The proof of Proposition 2 is similar to that of Proposition 1 in Brown, 
Cai and Zhou (2008) and is thus omitted here. See Cai and Zhou (2008) for 
a complete proof. 

We now consider the wavelet transform of the medians of the binned data. 
From Proposition 2 we may write 

J_ x = f(i/T) u Zj Ci 

Let (yj t k) = T~ X I 2 W ■ X be the discrete wavelet transform of the binned 
data. Then one may write 

(39) yj.k = e' jk + e jjfc + 2^ z i.* + 

where 9'- k are the discrete wavelet transform of (/(^))i<i<r, £j,fc are the 
transform of the Z^s and so are i.i.d. iV(0, 1) and ej^ and £ are, respec- 
tively, the transforms of (7^) and (7^=)- The following proposition gives the 
risk bounds of the block thresholding estimator in a single block. These risk 
bounds are similar to results for the Gaussian case given in Cai (1999). But 
in the current setting the error terms e,- f. and £j make the problem more 
complicated. 



Proposition 3. Let yj jk be given as in (39) and let the block threshold- 
ing estimator be defined as in (13). Then: 

(i) for some constant C > 0, 

E E (hk-0' Jik ) 2 <minL E (e' hk )\S\*Ln- l X 

(40) 

+ 6 E tlk + CLn~ 2 - 
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(ii) for any < r < 1 , there exists a constant C T > depending on r only 
such that for all (j, k) £ B % - 



(41) E0 jik - 9' j>k ) 2 < C T • mini max {(9' jM + e jjk ) 2 }, LrC 1 } + n 



-2+t. 



(iii) for j < J* and e„ > 1/logn, ^(v^l^kl > £ n) < C exp(-e n m). 

The third part follows from Lemma 3 in Cai and Wang (2008) which gives 
a concentration inequality for wavelet coefficients at a given resolution. 

For reasons of space we omit the proof of Proposition 3 here. See Cai 
and Zhou (2008) for a complete proof. We also need the following lemmas 
for the proof of Theorems 3 and 4. The proof of these lemmas is relatively 
straightforward and is thus omitted. 



Lemma 1. Suppose y% = 6% + Zi,i = 1, . . . ,L, where 6i are constants and 
Zi are random variables. Let S 2 = Y^i=iVi an d ^ ®i = (1 ~~ ^)+yi- Then 

(42) E\\§ - 6\\l < \\e\\l A 4AL + 4S[||x;|||J(||«||| > XL)]. 



Lemma 2. Let X ~ xl an d A > 1. Then 

P(X > XL) < e -V2(A-logA-l) and 

(43) 

EXL{X > XL) < \ Le -L/2(\-\o g x^i)_ 

Lemma 3. Let T = 2 J and let fj(x) = ELi ^/(fO^J.fcOz)- Then 

sup \\fj- f\\l<CT~ 2d where d = min(a - l/p,l). 

feB« q (M) 

Let {O'j k } be the discrete wavelet transform of {/(^), 1 < i < T} and let 
{®j,k} be the true wavelet coefficients of f. Then \0'j k — 9j ik \ < CT~ d 2~^ 2 
and consequently £/=i J2k(°j,k ~ °j,k) 2 < CT~ 2d . 

6.3.1. Global adaptation: proof of Theorem 3. Decompose E\\f n — /||| 
into three terms as follows: 

J*-i 

E \\g n - g\\l = ^2E(8 j0tk - e j)k ) 2 + ^2 E (®j,k - 6j,k) 2 

k j=jo k 



oo 



(44) 

j=J* k 

= 5i + 5 2 + S 3 . 
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It is easy to see that the first term S± and the third term S3 are small: 

(45) 5i = 2^n~ l e 2 = o(n^ 2a /( 1+2Q )). 
Note that for x G M. m and < pi < p 2 < 00, 

(46) ||s|| P2 < \\x\\ pi < m 1/pi ~ 1/p2 \\x\\ P2 . 

Since / G B« q (M), so 2-? s (y: 2 i 1 fc |f)VP < M. Now (46) yields that 

00 

(47) S 3 = J2 Y. 6 lk < C2~ 2J *^ a+1 / 2 - 1 /^ . 

j=J* k 

Propositions 2(ii) and 3 and Lemma 3 together yield 
J*-i J, -1 



S 2 <2J2J2 E(9j,k ~ e'^f + 2 £ ^'j,k ~ h 

3=30 k j=jo k 



k, 2 



(48) 



< J EE min {8 E ^KLn-A + 6 J f:^lk 

3=30 «=1 (j,k)eB* j=ju k 

+ Cn~ 1 + mY J Y.^k-o hk ) 2 

3=30 k 



J*-I2 j /L f . 

^ E E min i 8 E 8X * Ln ^ \ + c™ -1 + cT~ 2d . 

3=30 i=l (i,fe)eB5 



We now divide into two cases. First consider the case p>2. Let Ji = [ j— x 
log 2 n]. So, 2 Jl wn 1 /^ 4 - 2 "). Then (48) and (46) yield 

<s 2 < 8A* E Ln_1 + 8 E EX k + cn- 1 + cT- 2d 

3=30 »=1 J=Ji fe 

< (7ri -2a/(l+2a)_ 

By combining this with (45) and (47), we have £||/ n - f\\ 2 2 < Cn~ 2a ^ 1+2a ^ 
for p > 2. 

Now let us consider the case p < 2. First we state the following lemma 
without proof. 

Lemma 4. Let < p < 1 and S = {x G R k : Y^ =1 x P < B, x { > 0, i = 1, . . . , k} . 
Then for A>0, sup xeS £f =1 (a?* A A) < B ■ A l ~ p . 
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Let J 2 be an integer satisfying 2 J2 x ^/^(logn)^^ 1 ^. Note 
that 

E( E *?,*) <J2( d h) p/2 < M2 ~ jsp - 

*=i (i,fc)esj fc=i 

It then follows from Lemma 4 that 

J,-12>/L 



J2 E min { 8 E ^sA^n- 1 

;=J 2 i=1 (i,fc)efi] 



i=J 2 i=i v (j,k)eB 

(49) 

< Cn" 2a/(1+2Q) (log n )(2-p)/(p(i+2«)) _ 

On the other hand, 

J-x-XVjL 



£ ^ minis ^ ^SA.Ln" 1 } 
/'=jo i=i C?,fc)esj 



J 2 -l 

(50) < ^ ^A*^" 1 

< (7n~ 2a/(1+2Q) (log n) (2_p)/(p(1+2a)) . 

We finish the proof for the case p < 2 by putting (45), (47), (49) and (50) 
together: 

E\\f n - /||| < Cn- 2a /( 1+2Q )(logn)( 2 -^/^ 1+2Q )). 
6.4. Proof of Theorem 5. Recall that 

2-?'o Jq 2 j 

Q = E(y| ,fe - °i) + E Efe - *») 

fe=l j=jo k=l 

and note that the empirical wavelet coefficients can be written as 

Vj,k = 0j,k + e j,k + 0"n%fe + €j,k- 

Since (Ej> j x 0j k ? < C[2^ 2J ^ l Mf = o{ ± ), as in Cai and Low (2005) it 
is easy to show that 

{2J0 Jq 2 j ~) 2 

E[(^o,fc + °nz j0 , k ) 2 - a 2 n \ + ]T + <r n ^ fc ) 2 - a 2 ] - Q(/) 

fe=l i=io k=i ) 

<4a 2 M 2 (l + o(l)). 
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The theorem then follows easily from the facts below: 



Jq 2 J 




) 



£^+EEs><c^ 2(Q - 1/p) 



fc=l 3=30 k =l 




E[^i{al-al)f = o(^. 
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